Final Project Report

Author

Ria Krishna (2564944), Rishi Khullar (2547490), Keerthy Rangan (2549740)

Introduction Our project aims to use the World Bank’s datasets on education to compare the enrollment rate across different school levels for the most populated countries in South Asia, East Asia, and Southeast Asia over the past 55 years.

Data Description We used three World Bank datasets focusing on education. Each dataset looks at a different school level and its corresponding enrollment rate including primary, secondary and tertiary. Each dataset also looks at a large range of countries and includes enrollment rates starting from 1960. For the purposes of the scope of our project we are only looking at India, Pakistan, and Bangladesh for the South Asia region, China, Japan, and South Korea for the East Asia region, and Indonesia, the Philippines, and Vietnam for the Southeast Asia region.

Data Analysis Data Cleaning – Python We started our data analysis process by cleaning the dataset. In each dataset for primary, secondary, and tertiary education, the years 1960-1969 had no data. There were also some missing values scattered within the dataset. So, we dropped these years from each of the three datasets, and we also dropped NA values because keeping these would cause issues when we try to create our visualizations. Since we want to analyze enrollment rates for three regions, we had to decide which countries to use for each region. We decided to use the most populated countries in each region. In order to focus on these three countries for each region, we decided to break up each dataset (primary, secondary, and tertiary) into each region. We created a list for the countries in each region – one containing the South Asian countries, one for East Asian countries, and one for Southeast Asian countries. Then, for each dataset we used the “isin” command in Python to filter the dataset to only include countries in the list we created. With that process, we were able to create 9 data tables: Primary South Asia, Secondary South Asia, Tertiary South Asia, Primary East Asia, Secondary East Asia, Tertiary East Asia, Primary Southeast Asia, Secondary Southeast Asia, and Tertiary Southeast Asia. Doing this allowed us to more easily analyze the data we wanted to focus on.

Data Analysis – SQL Once we cleaned the data using python, we imported the datasets into SQL by manually typing at the table values we got in Python. Since each of our tables in Python had enrollment averages of all the years included, it was easy to create the tables without importing the csv files into pgAdmin. We first created the structure of the tables using CREATE TABLE statements, with 5 variables representing the country name, country code, indicator name, indicator code, and average enrollment rate. Then, we inserted our values in the table structure using the INSERT INTO statement, and queried all these statements to officially have our 9 tables set up in SQL. After creating all the 9 tables on SQL, we merged each of the South Asian tables, East Asian tables, and Southeast Asian tables together. For each of the mergings, we didn’t use a specific type of join, as we were merging three tables together for all the regions. Our result was three tables with the countries for a specific region as columns, and the rows being the country name, country code, the average primary school enrollment rate, average secondary school enrollment rate, and average tertiary school enrollment rate. After we queried these merges, we imported the data into csv files, which we used to aid in creating our python graphics.

Data Visualization To create our visualizations we had to further undergo data cleaning and manipulation. In order to graph enrollment rates over time, we had to convert each year as a row to create a year variable and restructure the format of the table. We did this using the melt function in Python to make the data in a long format. Then, we used the matplotlib package and a for loop to plot each country’s enrollment since 1970 distinguished by different colors. We plotted the different countries in each region by primary, secondary, and tertiary. We then averaged the enrollment rates for each year by region and graphed the averages. This allowed us to see how each region’s enrollment rate changed over the years compared to the others.

Results & Discussion In order to visualize our results, we created 12 line graphs in Python. The first three graphs show the primary school enrollment rates from years 1970 to 2020, grouped by region. The first graph compares enrollment rates for countries in East Asia, the second compares enrollment rates for countries in SouthEast Asia, and the third compares rates for countries in South Asia. For the East Asian countries, China consistently has the highest enrollment rates until about 2013, where rates for all countries drop. For the countries in SouthEast Asia, Indonesia and Vietnam show an almost opposite trend in enrollment rates – when the rates increase in Indonesia, they appear to decrease in Vietnam. The trend for the Philippines is similar to that of Indonesia. Vietnam’s trend follows a “W” shape, ending at enrollment rates similar to the start in 1970. For countries in South Asia, all three countries show a similar trend in enrollment rates. They all fluctuate a bit from 1970 to 1990, and then they all increase in a linear fashion. While India shows higher enrollment rates consistently, Bangladesh appears to have the higher enrollment rate today. Something to note for these three graphs is that the enrollment rates for primary school are all above 100% – this is something that is present in the original dataset as well, so it is something to consider exploring or evaluating why this may be the case.

Graphs 4-6 show the enrollment rates for secondary school. The three countries in South Asia seem to follow a similar pattern, with enrollment rates drastically increasing from 1990 to early 2000s, and continuing to increase in a fluctuating manner after. Out of the three countries, India consistently has the highest enrollment rates. The Southeast Asian countries followed a relatively positive linear pattern until the mid 2010s, and their enrollment rates are all about the same in the present day. The East Asian countries each show a very different pattern. Japan’s enrollment rates consistently stay the highest and do not change much over the years, while South Korea’s drastically increases and China’s follows a somewhat sinusoidal pattern and has the lowest enrollment rates, which is a stark difference from the results for primary education.

Graphs 7-9 show the enrollment rates for tertiary education. The countries in East Asia all show a similar pattern of enrollment rates, starting at a very low rate in 1960 and increasing drastically. Enrollment rates for schools in Southeast Asia fluctuate more, but they all converge to a similar rate of about 45%. Schools in South Asia show a very similar pattern – enrollment started very low in 1970 and grew heavily in the 2000s.

Graph 10 shows the enrollment rates for secondary education by year for all nine of the countries we focused on. Based on this graph, Japan has had the highest enrollment rates over the years – however, there was no data present for Japan and South Korea after 2000, so we do not know for certain if they still have the highest secondary education enrollment rates. Out of the countries where we do have the data, Indonesia has the highest enrollment rates as of 2023. Graph 11 shows the enrollment rates for secondary education by region. Based on this graph, the rates for each region generally grow in a linear fashion. Between 1970 and 2000, East Asia has the highest enrollment rates but we do not have data for East Asia beyond that. Between South Asia and Southeast Asia, Southeast Asia has higher enrollment rates as of 2023 and the rates are very high, nearing 100%. Graph 12 shows the enrollment rates for tertiary education by region, and we can see that East Asia consistently has the highest enrollment rates over time, with South Asia and Southeast Asia significantly lower.

Graph 1: Primary School Enrollment Rates for East Asia Primary – East Asia

Graph 2: Primary School Enrollment Rates for Southeast Asia Primary – Southeast Asia

Graph 3: Primary School Enrollment Rates for South Asia Primary – South Asia

Graph 4: Secondary School Enrollment Rates for South Asia Secondary – South Asia

Graph 5: Secondary School Enrollment Rates for Southeast Asia Secondary – Southeast Asia

Graph 6: Secondary School Enrollment Rates for East Asia Secondary – East Asia

Graph 7: Tertiary School Enrollment Rates for East Asia Tertiary – East Asia

Graph 8: Tertiary School Enrollment Rates for Southeast Asia Tertiary – Southeast Asia

Graph 9: Tertiary School Enrollment Rates for South Asia Tertiary – South Asia

Graph 10: Secondary School Enrollment Rates for All Countries Secondary – All Countries

Graph 11: Secondary School Enrollment Rates by Region Secondary – By Region

Graph 12: Tertiary School Enrollment Rates by Region Tertiary – By Region

Graph 13: Average Tertiary Enrollment Rates by Country Tertiary – Average

Conclusion Our project included leveraging different tools across Python and SQL to manage and clean our data as well as provide visualizations. By using the pandas package in Python we created subsets of the large World Bank Data sets to focus solely on three regions in Asia and use matplotlib to plot the relationships between different countries and regions. Our main findings support that, despite starting at a relatively similar enrollment, East Asia has had the largest increase in higher education enrollment rates since 1970 compared to South and Southeast Asia. We believe this partly to be due to different government structures and economic states within each region.